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Abstract 

We consider the stochastic multi-armed bandit problem with a prior distribution on the 
reward distributions. We show that for any prior distribution, the Thompson Sampling strategy 
achieves a Bayesian regret bounded from above by 14y/nK. This result is unimprovable in 
the sense that there exists a prior distribution such that any algorithm has a Bayesian regret 



bounded from below by A \JnK. 
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In this paper we are interested in the Bayesian multi-armed bandit problem which can be 
described as follows. Let -kq be a known distribution over some set 0, and let 6 be a random 
variable distributed according to ttq. For i G [K], let (Xj jS ) s >x be identically distributed 
random variables taking values in [0, 1] and which are independent conditionally on 9. Denote 
/ij(6>) := K(Xi t i\9). Consider now an agent facing K actions (or arms). At each time step 
t = 1, . . .n, the agent pulls an arm I t G [K], The agent receives the reward X^ s when 
he pulls arm % for the s th time. The arm selection is based only on past observed rewards 
and potentially on an external source of randomness. More formally, let (U s ) s >i be an i.i.d. 
sequence of random variables uniformly distributed on [0, 1], and let Tj(s) = J2t=i ^h=i> tnen 
I t is a random variable measurable with respect to a(Ii,Xi t i, . . . , I t -i,X It _ lTl (t-i), Ut). 
We measure the performance of the agent through the Bayesian regret defined as 






where the expectation is taken with respect to the parameter 6, the rewards (Xj s ) s >i, and the 
external source of randomness (U s ) s >i. 

The multi-armed bandit problem has a long history and we refer the interested reader to 
Bubeck and Cesa-Bianchi [2012] for a survey of this extensive literature. In this paper we 
are interested in studying the Thompson Sampling strategy which was proposed in the very 
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first paper on the multi-armed bandit problem Thompson [1933]. The strategy can be de- 
scribed very succinctly: let ir t be the posterior distribution on 9 given the history H t = 
(7i, X^i, . . . , It-i,X It _ lTi (t-i)) of the algorithm up to the beginning of round t. Then 
Thompson Sampling first draws a parameter 9t from -Kt (independently from the past given m) 
and it pulls I t £ argmaxjg^j /ij(#j). 

Recently there has been a surge of interest for this simple policy, mainly because of its flexi- 
bility to incorporate prior knowledge on the arms, see for example Chapelle and Li [201 1]. For 
a long time the theoretical properties of Thompson Sampling remained elusive. The specific 
case of binary rewards with a Beta prior is now very well understood thanks to the papers 
Agrawal and Goyal [2012a], Kaufmann et al. [2012], Agrawal and Goyal [2012b]. In partic- 
ular the last paper shows that in this specific setting the regret is bounded from above by 
C\/nK log n for some numerical constant C > 0. This result was greatly generalized 1 by 
Russo and Roy [2013] who proved that in fact this is true for any prior distribution ttq. Pre- 
cisely they show that Thompson Sampling always satisfies R n < 5^nK log n. Our main 
result is to show that the extraneous logarithmic factor in these bounds can be removed by 
using ideas reminiscent of the MOSS algorithm of Audibert and Bubeck [2009]. Precisely we 
prove the following theorem. 

Theorem 1 For any prior distribution ttq Thompson Sampling satisfies 

R n < uVnK. 

Remark that the above result is unimprovable in the sense that there exist prior distri- 
butions 7ro such that for any algorithm one has R n > ^\JnK (see e.g. [Theorem 3.5, 
Bubeck and Cesa-Bianchi [2012]]). This theorem also implies an optimal rate of identifica- 
tion for the best arm, see Bubeck et al. [2009] for more details on this. 

Proof We decompose the proof into three steps. We denote i*(9) £ argmax^gr^i ^(9), in 
particular one has I t = i*(9 t ). 

Step 1: rewriting of the Bayesian regret in terms of upper confidence bounds. This step 
is given by [Proposition 1, Russo and Roy [2013]] which we reprove for sake of completness. 
Let Bij be a random variable measurable with respect to a(Ht). Note that by definition 9 t and 
9 are identically distributed conditionally on H t . This implies by the tower rule: 

E -Bi*(0),t = E A*(6»0,t = ^ B h,t- 

Thus we obtain: 

E (Mi. W (0) " ViM) = E (lk*«f){e) - B t * m ) + E (B Iut - fi It (9)) . 
Inspired by the MOSS strategy of Audibert and Bubeck [2009] we will now take 



= A*i,Ti(t-i) + \ T , t _ 1 



'Note however that the result of Agrawal and Goyal [2012b] applies to the individual regret (for 9 fixed) while the 
result of Russo and Roy [2013] only applies to the integrated Bayesian regret. 



where pi^ s = ^ Ylt=i ^-i,u an d l°g+( x ) = log(x)l x >i. In the following we denote 8$ = 
2t/ -^. From now on we work conditionally on 9 and thus we drop all the dependency on 9. 

Step 2: control of E (ni*ie)(9) — -Bj* (#) 1 1 6>) . By a simple integration of the deviations one 
has 



E(fjL i .-B i . > t)<8 + I P(in*-Bi. 
JSo 



> u)du. 



Next we extract the following inequality from Audibert and Bubeck [2010] (see p2683-2684), 

for any i € [K] , 



P(/ii - B^ >u)<^log[J—u) + 
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K J nu 2 /K - 1 ' 



Now an elementary integration gives 



So ^ l ° g {^ u]dU 



4K , / /n 

"™ log l e ViP 
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Thus we proved: E(/v (e) (0)- ^ {e) , t |0) < (2 + 2(1 + log 2) + ^J ^f < G^f . 
Step 3: control of Ylt=l ^ i^h.t — tth (^)l^)- We start again by integrating the deviations: 

n p-\~oo n 

i=l ^ 5 ° t= l 



Next we use the following simple inequality: 

n n K 

X>{Bj t ,t-/iit>u}<X)Z) 1 < 



t=i 



which implies 



=1 i=l 



i^ n 






^p(B w - Wt > M )<^^pL v + i pi^l- ft >« 



t=l i=l s=l 
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Now for u > 8q let s(n) = [3 log ( ^- ) /u 2 ] where \x] is the smallest integer large than x. 
Let c = 1 — t=. It is is easy to see that one has: 



E F \^ + \h ± ^--Hi>n 



31og(^ 



+ ^ P (Am - AK ^ 

s=s(u) 
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Using an integration already done in Step 2 we have 
~+oc 3 log (^ 

Next using Hoeff ding's inequality and the fact that the rewards are in [0, 1] one has for u > Sq 



r^^™))^,^. 



Emf \ sr^ i 2 2\ exp(— 12c 2 log2) 
F(Mi,s ~ Hi>cu) < 2^ exp(-2sc u )l„<i/ c < — . 2 2 , K<i/c- 
i exp( zc u ) 

s=s(u) s=s(u) 

Now using that 1 — exp(— x) > x — x/2 for x > one obtains 

l/c X /.l/(2c) j_ /.l/c ]_ 

du = ; - n . d,U + / ; — T, r, s du 



6 1 - exp(-2c 2 n 2 ) J 5o 1 - exp(-2c 2 u 2 ) A/(2 C ) 1 - exp(-2c 2 u 2 

/■i/(2c) x j 

- 7 5o 2c 2 u 2 - 1c^ dU + 2c(l - exp(-l/2)) 

rl/(2c) 2 1 

< / TT-^du H ; ; -— - 

" 7 5o 3c 2 u 2 2c(l - exp(-l/2)) 
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3c 2 <5 3c 2c(l-exp(-l/2)) 



Putting the pieces together we proved 

n 

t=i 
which concludes the proof together with the results of Step 1 and Step 2. 
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